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Abstract 

We are interested in the following version of Jeffreys's law: if two pre- 
dictors are predicting the same sequence of events and either is doing a 
satisfactory job, they will make similar predictions in the long run. We 
give a classification of instances of Jeffreys's law, illustrated with exam- 
ples. 

1 Introduction 

In this paper we are interested in games of prediction for which Jeffreys's law, 
as stated in the abstract, holds. Specific true instances of Jeffreys's law will be 
referred to as Jeffreys theorems. 

In Section [2] we define several popular games of prediction and state Jeffreys 
theorems for the absolute-loss, square-loss, and bounded square-loss games. 
These results serve as illustrations for our taxonomy of Jeffreys theorems; 
namely, we distinguish between Jeffreys theorems of level 1 (weakest), level 

2 (intermediate), and level 3 (strongest). 

In Section [3] we show that in the case of so-called perfectly mixable games 
there is no difference between the three levels of Jeffreys theorems. Perfectly 
mixable games include, in particular, log-loss games and the bounded square- 
loss game. 

In the next section, SectionlH we state level 2 Jeffreys theorems, which cover 
the log-loss and square-loss games (not necessarily bounded). In combination 
with the results of Section [3] this provides us with examples of level 3 Jeffreys 
theorems. Some of the results in Section 3] are explicit inequalities, not just 
statements of convergence. 

The simple method of Section 2] docs not work for the absolute-loss game. 
In Section [S] we will see that it is still possible to prove a Jeffreys theorem for 
this game, albeit only a level 1 one. 

Perhaps the first instance of Jeffreys's law was proved by Blackwell and 
Dubins [2]; a pointwise version of their result was established in [3j. Results 
similar to ours but stated in terms of the algorithmic theory of randomness 
were earlier obtained in (developing [6]) and [5] in the case of the log- loss 



game, and in [TT] (in essence developing [S]) in the case of the bounded square- 
loss game. 

2 Taxonomy and examples of Jeffreys theorems 

A game of prediction is a triple (f2,r,£), where fl and F are arbitrary sets, 
called the outcome space and prediction space, respectively, and £ : f2 x F — > M 
is called the loss function. The game is played according to the following perfect- 
information protocol. 

Competitive prediction protocol 

Players: Nature, Predictor 1, Predictor 2, Sceptic 

Protocol: 

FORn = 1,2,...: 

Predictor 1 and Predictor 2 announce G F and 7!? G F. 

Sceptic announces 7„ G F. 

Nature announces uj^ G fl. 
END FOR 

Three of the players, two Predictors and one Sceptic, are trying to predict the 
outcome aj„ to be announced by Nature. Sceptic is just like another Predictor, 
but he will be playing a special role in our story. At step n, Predictor 1 and 
Predictor 2 issue predictions ^l^^ and jn\ respectively. The Predictors can 
consult each other when making the predictions, and the pair (7n 7n') can be 
regarded as their joint prediction. After the two Predictors have announced. 
Sceptic issues his own prediction 7^. Then Nature produces w„. Let := 
J2n=i ^i'^n, Jn^) be the cumulative loss to time N of Predictor k, k = 1, 2, and 
similarly Ln for Sceptic. 

The absolute-loss game is (M,R, ^) where €(w,7) := |w — 7I. The next propo- 
sition states our first Jeffreys theorem. 

Proposition 1. Sceptic has a strategy in the absolute-loss game that guarantees 



As usual, we set 1/0 := 00 in ([T]). For the proof of Proposition [H see Section^ 
We call ([T|), perhaps with |7j^' — 7)^' | replaced by a different distance, a level 
1 Jeffreys theorem. It says that for a sufficiently distant outcome ojn , N ^ 1, at 
least one of the following three things happen: the two Predictors' predictions 
7U' and 7!^' are close to each other; Sceptic greatly outperforms Predictor 1 
by time iV; Sceptic greatly outperforms Predictor 2 by time N . The weakness 
of this statement is that no "stabilization" is guaranteed along a given infinite 
sequence of outcomes ujiu}2 . . it is possible that each one of the three terms of 
the disjunction will be violated infinitely often. 
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A stronger Jeffreys tfieorem, wfiicfi we call a level 2 Jeffreys theorem, would 
say that 
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An even stronger statement, which we call a level 3 Jeffreys theorem, would be 
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The following two propositions give examples of level 2 and level 3 Jeffreys 
theorems. The square-loss game is (R, M,£) where £{uj,j) := {uj — 7)^. 

Proposition 2. Sceptic has a strategy in the square-loss game that guaran- 
tees (0). 

The bounded square-loss game is ([0, 1], [0, 1],^) where £{lj,j) := (oj — 7)^. 
(We fix specific bounds, and 1, for outcomes and predictions, but our results 
generalize in a straightforward manner to any other bounds.) 

Proposition 3. Sceptic has a strategy in the bounded square-loss game that 
guarantees 

Proposition [2] will be proved in Section 21 and it will imply Proposition [3] in 
combination with results of Section [31 



Counterexample 

The bounded absolute-loss game is ([0, 1], [0, 1],^) where i?(w,7) := \uj — 7]. The 
level 3 Jeffreys theorem does not hold for the bounded absolute-loss game: 

Proposition 4. Sceptic does not have a strategy that guarantees ^ in the 
bounded absolute-loss game. 

Proof. Suppose Sceptic has such a strategy and is playing it. Let Nature produce 
and 1 independently with probability 1/2 each. Predictor 1 always predicts 
and Predictor 2 always predicts 1. The restriction of Sceptic's strategy to 
ujn e {0, 1} and 7n',7n' e {0, 1} is automatically measurable. We can see that 
Ln^ —Ln and l!?' —Ln are martingales with bounded increments, and so tend to 
00 with probability zero (see [7], Theorem VII. 5.1 and its corollary). Therefore, 
^ happens with probability zero. □ 

The proof shows that Proposition |4l remains true for the restricted game 
({0,1}, [0,1],^), £(^,7) := k-7l- 
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3 Reductions between Jeffreys theorems 



It appears that the main factor that determines which Jeffreys theorems hold for 
a particular game of prediction is the degree of convexity of the game. We might 
define a game to be convex if its prediction set F is a convex set in a linear space 
and its loss function i{uj,^) is convex in 7 g F. However, this definition would 
be too narrow, since the predictions 7 are usually just arbitrary labels. We start 
from introducing a much less arbitrary representation of games of prediction. 
A canonical prediction is a function A : ^ M such that 

37 e FVtj G O : =£(w,7). 

The canonical representation of the game (r2,F,£) is the pair (SI, A) where A, 
called the canonical prediction set, is the set of all canonical predictions. We 
will not always distinguish between the game and its canonical representation 
and will usually consider games that are non-redundant in the sense that 

(Ai,A2eA&Ai<A2)^Ai = A2. (4) 

A superprediction (resp. subprediction) is a function A : — > R such that A > A' 
(resp. A < A') for some canonical prediction A'. The set of all superpredictions 
(resp. subpredictions) will be denoted A (resp. A) and called the superprediction 
set (resp. subprediction set). 

We will be interested in three notions of convexity for games of prediction: 

• a game is convex if its superprediction set A is convex (equivalently, if a 
convex mixture of two canonical predictions is always a superprediction); 
this condition is always satisfied if F is a convex set and the loss function 
i{Lo, 7) is convex in 7 G F; 

• a game is strictly convex if a non-degenerate convex mixture of two canon- 
ical predictions is always an interior point of A (in the topology of uniform 
convergence) ; 

• a game is perfectly mixable if, for some i] > 0, the set e~''^ is convex. 

For illustrative purposes it is convenient to consider the case where the game 
{fl,r,£) is binary, in the sense — {0, 1}. In this case A can be represented as 
the subset of consisting of the points (x, y) = (A(0), A(l)) where A ranges over 
A. An example is given as the curved line in Figure[T]below; the superpredictions 
are the points North-East of the line, and the subpredictions are the points 
South- West of the line. 

It is easy to see that for perfectly mixable prediction games there is no real 
difference between the three levels of Jeffreys theorems: 

Proposition 5. Suppose Sceptic can guarantee (Qp in the competitive prediction 
protocol for a perfectly mixable game. Then he can also guarantee (0) ( and, a 
fortiori. 
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Proof. Consider the generalization of the competitive prediction protocol in 
which there are infinitely many Predictors (called Experts and numbered by 
k — 1,2, . . .) instead of just two. Using the Aggregating Algorithm (see, e.g., 
[TP] , Subsection 2.1), for any sequence pi,P2, ... of positive weights summing to 
1 Sceptic can guarantee that his loss satisfies 

Ln <L^N^ +C\n— (5) 
Pk 

for all iV = 1,2,... and k = 1,2,..., where C is a constant depending on the 
prediction game. 

Let Sceptic play a strategy that guarantees H]). We will construct a new 
strategy for Sceptic that guarantees ([3]). Consider the following doubly infinite 
set of experts: 

• Expert (fc, 1), fc = 1, 2, . . ., plays as Sceptic until the difference — Ln 
exceeds 2*^; as soon as this happens (if it ever happens), he starts playing 
as Predictor 1; 

• Expert (fc, 2) plays as Sceptic until the difference Ln' — L„ exceeds 2*^; as 
soon as this happens, he starts playing as Predictor 2. 

The weights pk,i and pk,2 assigned to these experts are pk,i — Pk,2 = 2~'^~^. 
Applied to these experts, the Aggregating Algorithm provides a new strategy 
for Sceptic that guarantees ([3]). Indeed, suppose the first of the three terms in 
([3]) is false. Then, by ([T]), either the second or the third term in ([3|) becomes 
true when lim is replaced by limsup. Suppose, for concreteness, it is the second 
term. For each k, Expert (fc, l)'s loss satisfies lJ^'^I < - 2'= from some N 
on, and so ([5]) implies that the Aggregating Algorithm's loss satisfies 

Ln < L^^^l +Cln— < lJ^' - 2*^ + (Cln2)(fc + 1) 

Pk,l 

for all k and from some TV on. Letting fc ^ oo, we can see that the second term 
of ([3]), with Lat in place of Ln, is true. □ 

Of course, Proposition [5] will continue to hold if the Euclidean distance in 
(dl, (12), and ^ is replaced by any other distance. 

Examples of perfectly mixable games 

The bounded square-loss game is perfectly mixable (yiOj, Subsection 2.4). 

Perhaps the most fundamental class of games of prediction is that of log- 
loss games. If {il,T,£) is a log-loss game, is a measurable space with a 
fixed cr-finite measure fi (more generally, ^ — fin may depend on n and be 
announced by a player, say Nature, at the beginning of step n of the game), F 
is the set of all measurable functions 7 : il — > [0, cxo) satisfying J jdfi = 1, and 
£{uj, j) — — \nj{uj). For log-loss games the loss function is allowed to take value 
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00 (— InO := oo). A simple and instructive special case to keep in mind is where 
fi is the comiting measure on a countable Vl. The perfect mixability of log-loss 
games is a well-known fact, and the Aggregating Algorithm for them reduces to 
the Bayes rule (for details see, e.g., [TU], Subsection 2.2). 

For other examples of perfectly mixable games (such as the KuUback-Leibler 
game and Cover's game), see [TO], Subsection 2.5. 

4 Level 2 Jeffreys theorems 

If Ai and A2 are canonical predictions and a G (—1, 1), we set 

^'"1 (Ai II A2) sup (i e K : i-^Ai + A2 - i G a| (6) 

1 — y 2 2 J 

(the lower a-divergence between Ai and A2) and 

D^"^ (Ai II A2) := inf (teR: ^Ai + A2 - t e a\ 

1 — [_ 2 2 J 

(the upper a-divergence between Ai and A2). The lower and upper divergence 
make take values —00 or 00. We will be mostly interested in lower divergences 
(which for many interesting games coincides with upper divergences). In the 
case of binary {^l,T,£) this definition is illustrated in Figure [T] (notice that the 
difference between lower and upper a-divergences disappears for convex binary 
games; in such cases, we will sometimes write D^^^Xi || A2) for the common 

value of i2'"'(Ai || A2) and D " (Ai || A2) and omit the adjectives "lower" and 

"upper"). We will also write i2'"'(7i II72) and D^°'\"fi || 72) for 71, 72 G F, in 
the obvious sense. 

Notice that, for strictly convex and non-redundant (in the sense of (jlj) 
games, 

^'"'(Ai||A2)>^I"l(Ai||A2)>0, 

for all Ai, A2 G A. For a — the lower (resp. upper) a-divergence is called the 
lower (resp. upper) Hellinger distance] the word "distance" is partly explained 
by its symmetry (although simplest examples show that there is no continuous 

function / such that /(I?^"') or /(Z)'°') is a metric for every strictly convex 
game) . 

The values of lower and upper a-divergences for a = ±1 are defined as their 
limits as a — > ±1 when those limits exist. The lower (resp. upper) — 1-divergence 
is called the lower (resp. upper) Kullback-Leibler divergence and is especially 
important. 

Remark. It is not difficult to see that upper divergences can be very different 
from the corresponding lower divergences even for "nice" (in particular, strictly 
convex) games. For example, for the game ([—1, 1], [—1, 1], (w — 7)^) the lower 
and upper Hellinger distances between the predictions —1 and 1 are different, 1 
and 7. (Cf. [5, Lemma 3.) 
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Figure 1: The interpretation of the a-divergence between canonical predictions 
Ai and A2 in the binary case: find the mean -^^Ai + ^-^^2 of Ai and A2; find 
the intersection A of the prediction set and the slope 1 line passing through the 

mean; multiply the horizontal (=vertical) distance between the mean and A by 
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The square-loss and log-loss games 

In this subsections we will compute lower and upper divergences for two popular 
games of prediction defined earlier. 



Lemma 1. In the square-loss game, 

i?'"l(7i||72) = (7i-72)' 
for all a £ [—1, 1] and 71, 72 G K. 



(7) 



Proof. It suffices to consider the case a S (—1,1). The statement of the lemma 
will follow from the fact that, for all w £ R, 



1-a 



(71 - H T— (72 - t^J 
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1 + a 



-71 



-72 - ^ 



2 ' 2 

If we set ti := 71 — and t2 := 72 — 1^, the last equality simplifies to the obvious 



1 — a 



-ti + 



1 + a 



-ti 



1-a 1+a ^ 



□ 
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Lemma 2. In any log-loss game, 

i?["l(7i II 72) = In / (7iH)'^(72H)'*'Mdt^) (8) 

for all a £ (—1, 1) and 71, 72 £ F. 

Proof. The left-hand side of ([8]) can be written as j^^t where t is defined from 
the condition that, for some 7 G F and all cj G il, 

ln7i('^) - ln72(w) - t= -ln7(w). 

Deducing 

ld^i= I (7i(w))^(72(w))^/i(dw)e*, 
! Jn 

substituting 1 for J jdii, and solving the resulting equation for t, we obtain the 
statement of the lemma. □ 

The standard definition of the a-divergence for the log- loss game (see, e.g., 
[1], p. 57) is 

^^("^71 II 72) = (1 - J^hi{^))'^h2i^))'^Kdi^)^ ; 

it is clear that this will differ little from ([5]) when 71 and 72 are close in a suitable 
sense. The inequality In a; < a; — 1 implies D^"^ < Z?["l. 

Level 2 and level 3 Jeffreys theorems 

This is our most general level 2 Jeffreys theorem: 

Proposition 6. For each a G (—1,1) and e > Sceptic has a strategy that 
guarantees 

'-^ t i^'"' II 7i^0 < ^^4' + ~L. + ^ (9) 

n=l 

Proof. The strategy is obvious: according to ([6]), at step n Sceptic can choose 
a canonical prediction A satisfying 

A < ^^A, + i±i^A2 - ^^i2["'(A, II A2) + .2- 

(Ai and A2 being the canonical predictions corresponding to "fn^ and 7n'). Sum- 
ming over the first N steps, we obtain ([9]). □ 
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Specializing ^ to the case a = and the square-loss game gives 

n=l 

This implies a stronger version of the level 2 Jeffreys theorem 1^: 

X! ('^» ' ~ "^^ ') < jj'^J^^ ™ax (^lJ^' - Ln, l!§ - Ln^ = oo. 

n=l 

In combination with the proof of Proposition O this implies the stronger form 

oo 2 

E (^»' - ^"O < °° (^i^' - Zat) = oo or Jmr^ (lJ^' - Lat) = ex. 

n=l 

(10) 

of the level 3 Jeffreys theorem ([3]) for the bounded square-loss game. 

For the log-loss game, we obtain (fTO|) with the Hellinger distance 
-Dl°J(7li"^' II 7^?'), or the standard Hellinger distance -D^^^T" ' II 7" place 
of(7l^l-7l?l)^ 



5 Level 1 Jeffreys theorems 

The main goal of this section is to prove Proposition [TJ In the absolute-loss 
game, the divergence between any two predictions is 0, and so the methods of 
the previous section are not applicable. 

First we describe a strategy for Sceptic that will later be shown to ensure 
([!]). Let / : [0, oo) — > [0,1/2) be a strictly increasing and concave function 
satisfying /(O) = and /(oo) < 1/2; see Figure [H Later it will be convenient 
to extend / to (— oo, oo) by the central symmetry w.r. to the origin O (so that 
/ : (— oo, oo) (—1/2, 1/2) is an odd function). 

Suppose just before step n — 1.2,... of the competitive prediction protocol 



> (the case where L^^Li < L^n-i '^i^l later be 



we have D^-i := L^J-i ~ ^ 
reduced to this one). Sceptic's move can be represented as 

7„:= (l-i„)7l'l+i«7i", 
where i„ will be chosen later from the interval [0, 1/2]. Set 
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Figure 2: The function / from the proof of Proposition [T] 
If the actual outcome a;„ is in favour of Predictor 1, 

the difference iL^' — L^n between the losses of the two Predictors will decrease 
to Dn = Dn-1 — dn and the difference L„ — L„ will increase by 

i{cUn, In) " 4 - (1 - tn) (^L " + ^" + y) " = " ^) 

So in fact it will decrease as i„ < 1/2. Let us set tn := 1/2 — /(Z?„_i). The 
difference L„ — Z„ will decrease by the area of the rectangle P3P5P4P1. 
If the actual outcome ujn is in favour of Predictor 2, 



the difference between the losses of the two Predictors will increase to Dn 
Dn-1 + dn and the difference _L„ — L„ will increase by 



eiuJn, In) - in = (1 " t„) ( in + + " ^ 



- — t„ J (i„ — f{Dn-l)dn, 



i.e., by the area of the rectangle P^Pe,PTPi. 

We can see that in both cases, Dn = Dn-i ± the difference L„ — Z„ 
increases by " ^ / minus the area An of a curvilinear triangle {P1P2P4 if 
Z3„ = Dn-1 — dn and P4P7P6 if -D„ = -Dn-i + dn). Now extend / to the 
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whole of (— cxD, oo) as an odd function. Suppose that Dn^i < and, moreover, 
Dn-i + dn < 0. Applying the same argument as above but with the roles of 
Predictor 1 and Predictor 2 interchanged, we can see that the difference I/„ — L„ 
again increases by " ^ / minus the area An of a curvilinear triangle. It is easy 

to check that the difference L„ — Z„ will change in the same way also in the 
case where D„_i > but Z3„_i ~ c?n < and in the case where Dn-i < but 
Dn-i + dn > 0. Since Ln — Ln is the cumulative increase in Ln — Ln over 
n = 1, . . . , iV, we can see that 

Ln — Ln = / / — An ■ 

Jo n=l 

It remains to consider two cases: 

< oo: In this case, — > and so 



oo 



as iV ^ oo. The sequence N — 1,2,... can be split into three subsequences 
such that 17)^' — 7)^' I along the first, Dn — > 00 along the second, and 
Dn —00 along the third. It suffices to show that ([T]) holds along the 
second subsequence (the case of the third subsequence is analogous, and 
the case of the first subsequence is trivial). Assuming Dn > 0, we can see 
that along the second subsequence: 








Ln = Ln+ I f 



n=l 

f^" [1] / 1 

< 2 ^ J <L^ +Dn yj{oo) - - 

and so lJ^' — Ln — > 00. 

^^j^ An = 00: In this case we have along the subsequence of N for which 
Dn > 0: 

i-Dn ^ 

Ln^Ln+ f~^An 

•^0 n=l 

L^ +Lj^ ~Vn ^ I f_\^ A < r [1] _ V /I 

J 2^ " — ^N / y , 

and so L^^ — Ln ^ 00. Similarly, L^^ — Z^r — s- 00 along the subsequence 
of N for which Dn <Q- Therefore, ^ holds. 
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Convex games 



It is easy to see that the proof of Proposition[T]is appHcable to any convex game. 
For any such game Sceptic has a strategy in the competitive prediction protocol 
that guarantees 

lim max — —-^ - ^w, ^^n - = oo. 
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